Skip to main content

S3 Vectors

Overview

AWS S3 Vectors allows you to perform semantic and similarity search over vector embeddings stored within S3 vector buckets. You execute similarity searches by specifying the vector bucket, index name, query vector, and parameters like the number of results to return (top K). Metadata filtering can be applied to refine search results by attributes such as genre or timestamp, and results can include matching vectors along with their associated metadata and distance scores.

A general overview of S3 Vectors can be found at https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors.html and limitation information can be found at https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors-limitations.html.

Querying Overview

Query Specification

One way to query S3 Vectors data from Qarbine is to use a JSON-like structure. Below is an example to retrieve up to 10 movies that have some similarity to the given query vector.

{
vectorBucketName: 'media-entertainment',
indexName:"movies",
queryVector: {float32: [! sampleEmbeddings('funny') !]},
topK: 10,
returnDistance: true,
returnMetadata: true,
}

This uses a pre-calculated embedding for 'funny cartoons with family themes' from text-embedding-ada-002. There is also one using the alias of 'scary' for 'scary movies with mythical villians'.

Note that S3 bucket names and vector index names are case sensitive!

AWS S3 Vectors offers a variety of parameters to control how the comparison is done and what properties are returned. You can query using one or a combination of a semantic (i.e. vector) search and a lexical (i.e. scalar) search. The former is for ‘similar’ oriented searches while the latter uses more traditional matching techniques (similar to a SQL WHERE clause).

Below is an example of a movie object stored in the vector index.

  

Unlike the strictly columnar result rows found in SQL databases, the resulting objects are returned as nested JSON objects. In this example there are first level fields of keys, metadata, and distance. The “key” field is the S3 Vectors’ object identifier.

General Querying

The primary query specification options are described below.

Field Description
vectorBucketNameThe name of the S3 Vectors bucket to access.
indexNameThe name of the index to query.
queryVectorThe embedding value in “[...]” or “{float32: [...] }” format. S3 Vectors currently only supports float32 values. Qarbine automatically converts any queryVector value which is an array of numbers into “{float32: [...] }” format.
filterThe filter specification. See below for more details.
topKThe maximum number of matches to return. It is similar to a SQL LIMIT clause. The default is 30.
returnProperties Which properties of the object to return in the answer set. This can be used to reduce the answer set size provided by S3 Vectors. Specify the metadata fields wanted. If you want the distance then include that in the list. If you want the underlying vector then include ‘data’ in the list.
returnMetadataA boolean indicating whether to return any metadata. The default is true.
returnDataBoolean indicating if the row vectors should be included in the answer set. The default is false to reduce the answer set size.

Fetching Specific Keys

To retrieve specific objects based on keys use these options.

Field Description
vectorBucketNameThe name of the S3 Vectors bucket to access.
indexNameThe name of the index to query.
keysThe list of vector keys.
returnProperties Which properties of the object to return in the answer set. See above for more details.
returnMetadataA boolean indicating whether to return any metadata. The default is true.
returnDataBoolean indicating if the row vectors should be included in the answer set. The default is false to reduce the answer set size.

Managing Answer Set Shape

The main Data Source Designer guide provides details on using Qarbine pragma to manipulate answer row shapes. For example, the query specification

{
vectorBucketName: 'media-entertainment',
indexName:"movies",
queryVector: {float32: [! sampleEmbeddings('funny') !] },
topK: 10,
returnDistance: true,
returnMetadata: true,
}

returns this answer set.

  

A sample movie row looks like the following

  

Adding a Qarbine pragma

#pragma pullFieldsUp metadata
{
vectorBucketName: 'media-entertainment',
indexName:"movies",
queryVector: {float32: [! sampleEmbeddings('funny') !] },
topK: 10,
returnDistance: true,
returnMetadata: true,
}

results in an easier answer set to review as shown below in 2 half snippets.

  

  

Accessing the row fields and using the row files within formulas is also much easier without the added ‘metadata’ object level.

Managing Answer Set Size

AWS S3 Vectors does impose rules for providing a maximum number of matches to return. See the documentation for the latest details. If a topK (AKA “limit”) is not set in the query specification, then the data source’s maximum objects setting is applied.

The default maximum number of rows starts off at 25 for a new data source. This is useful to evolve a query from a concept to one that you have verified returns the desired answer set. As noted, any native way of limiting an answer set size is the preferred approach. This setting is in the component dialog as shown below and also accessible by clicking the ‘Gear’ icon.

  

Once you are done drafting you can adjust this parameter. A “0” indicates there is no maximum. A number greater than 0 indicates to limit the final answer set size to that number of rows. This answer set truncation comes after any native query limit. So, if the answer set from the data endpoint is quite large or timeconsuming to create, that content has to be returned to the Qarbine host. It then may truncate the number of rows. It is best to truncate at the query level (i.e., use topK: or SQL LIMIT) to reduce the content sent from the data endpoint to the Qarbine host in the first place.

Adjusting the Maximum Rows

Recall the default maximum rows at the component level is 25. When you are satisfied with your query you can change that setting by clicking.

  

Adjust the setting to “0” indicating no Qarbine answer set truncation.

  

Click

  

“Similarity” Strategies

When using AWS S3 Vectors to perform similarity searches, the service returns a list of nearest neighbors based on your query vector. Each result can optionally include a ‘distance’ value, which quantifies how similar the result is to your query (lower values generally mean more similar). If you want to only return results that are “not too far” (i.e., only sufficiently similar content), thre is no S3 Vectors API option for this metric- only the topK setting which is independent of distance.

Qarbine pragmas can be used to filter the S3 Vectors answer set based on its criteria including the distance value. Consider this query specification

select *, distance
from 'media-entertainment.movies'
where nearText("funny", 'myOpenAI')
limit 20

It returns these 20 rows.

  
  

To apply a secondary distance filter use the runPostQuery pragma as shown below.

#pragma runPostQuery select * from data where distance < 0.222
select *, distance
from 'media-entertainment.movies'
where nearText("funny", 'myOpenAI')
limit 20

The “data” in the pragma line references to a virtual answer set and does not have anything to do with S3 Vectors row’s “data” field.

The answer set is shown below.

  

Max_distance is a threshold you set to decide which vector search results are “close enough” to your query to be considered relevant. Determining the ideal value for max_distance in AWS S3 Vectors requires both an understanding of your chosen distance metric (Cosine or Euclidean) and testing with your specific dataset and use case.

Cosine distance usually yields values between 0 (identical) and 2 (opposite); typical “similar” matches have cosine distance much closer to 0. Whereas Euclidean distance scales with the number of vector dimensions and value magnitudes; smaller values indicate greater similarity.

The optimal values depend on your embedding model, the diversity in your dataset, and how “similar” you want returned matches to be. It is an iterative process of reviewing the result details and sometimes using visualizations as well.

Date and Time Handling

The section above note the use of Qarbine pragma to manipulate answer set shape. The main Data Source Designer guide lists the various pragmas available. Of particular interest for S3 Vectors is the convertToDate pragma because metadata dates are not supported.

Date oriented values can be stored as numbers (seconds or milliseconds since epoch) or perhaps as ISO formatted strings. The downstream analysis likely wants genuine date objects though for formula use and presentation formatting. Consider a movie object such as

{
key: "movie_115",
metadata: {
title: "The Summer Dayz",
releaseDate : 1579046400000,January 15, 2020

}
}

Using this sequence of pragmas

#pragma pullFieldsUp metadata  ← Not needed when using S3VectorsSQL
#pragma convertToDate releaseDate

the resulting object would look like

key: "movie_115",
title: "The Summer Dayz",
releaseDate : a JavaScript Date with January 15, 2020 UTC

}

Dynamic Embedding Prerequisites

Prior to using Qarbine’s embeddings(...) macro function or the SQL-like query function nearText(...), the Qarbine Administrator must first configure “AI Assistant(s)”. The AI Assistants provide access to various popular Generative AI services and are referenced using an alias. Check with your Qarbine administrator for which ones are available and their proper use. For example, when using dynamic query vector embeddings, the model used by the AI Assistant must be compatible with the one used to generate the original embedding values in the database.

Filtering Vectors

S3 Vectors filtering is similar to SQL WHERE clauses in concept, but not at all in syntax! You can attach metadata (for example, year, author, genre, and location) as key-value pairs to your vectors. Nesting provides AND and OR logic as well. By default, all metadata is filterable unless you explicitly specify it as non-filterable. You can use filterable metadata to filter your query results based on specific attributes, enhancing the relevance of your queries. Vector indexes support string, number, boolean, and list types of metadata.

For more information see
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors-metadata-filtering.html.

Here is the information formatted as a basic table:

Operator Value Type Description
$eqString, Number, BooleanExact match comparison for single values. When comparing with an array metadata value, returns true if the input value matches any element in the array.
For example, {"category": {"$eq": "documentary"}} would match a vector with metadata "category": ["documentary", "romance"].
$neString, Number, BooleanNot equal comparison
$gtNumberGreater than comparison
$gteNumberGreater than or equal comparison
$ltNumberLess than comparison
$lteNumberLess than or equal comparison
$inNon-empty array of primitivesMatch any value in array
$ninNon-empty array of primitivesMatch none of the values in array
$existsBooleanCheck if field exists
$andNon-empty array of filtersLogical AND of multiple conditions
$orNon-empty array of filtersLogical OR of multiple conditions

The filtering syntax uses a JSON-based structure resembling many NoSQL filtering. Some examples are:

{"genre": "documentary"}
{"genre": {"$eq": "documentary"} }
{"year": {"$gt": 2019} }
{"$and": [ {"category": "technology"}, {"version": {"$gte": "1.0"} } ] }
{"$or": [ {"genre": "drama"}, {"year": {"$gte": 2020} } ] }

As you can see, the standard S3 Vectors filtering syntax can be a bit verbose. Below is an example of a query specification using a filter.

{
vectorBucketName: 'media-entertainment',
indexName: "movies",
queryVector: {float32: [! sampleEmbeddings('funny') !] },
topK: 10,
returnDistance: true,
returnMetadata: true,
filter: {"year": {"$gte": 1996 } }
}

A more complex filter argument is

{"$and": [ {"rated": {"$eq": "PG13"}} , {"year": {"$gte": 1996} } ] }

Qarbine SQL Oriented Querying (S3VectorsSQL)

Overview

Recall that S3 Vectors supports semantic (i.e. vector) search and a lexical (i.e. scalar/matching) search. The use of the raw filter syntax can be a bit verbose and cumbersome though. To improve readability and productivity when authoring retrievals, Qarbine provides a SQL oriented option. For example, the goal to retrieve up to 3 movies that have some similarity to “dracula” can look like the following using a query specification.

{
vectorBucketName: 'media-entertainment',
indexName: "movies",
"topK": 3,
"nearText": "dracula"
}

The Qarbine SQL equivalent is simply

SELECT * 
FROM movies
WHERE nearText("dracula") limit 3

Qarbine’s S3 Vectors integration goes much further though and extends to the filtering features as well. S3 Vectors still requires a JSON filter specification but Qarbine is your co-pilot translating SQL-oriented queries into their S3 Vectors equivalents. Adjusting our movie retrieval specification above and adding criteria for the movies to have a year at least of 1996 would be

SELECT * 
FROM movies
WHERE nearText("dracula") and year >= 1996 limit 3

The JSON filter equivalent is much more verbose and cumbersome to define. Qarbine allows you to avoid this frustration in many cases. You can always use the JSON structure though at any time.

When SQL is used the following pragma is automatically enabled

#pragma pullFieldsUp metadata

In some cases the Qarbine Data Source will have literally just the SQL statement above and nothing more. There are techniques to blend the ease of using SQL along with the powerful features of S3 Vectors within a Qarbine JSON specification object. The table below lists the fields that drive this definition.

JSON Field Description
sqlThe SQL statement can affect all of the options listed above.
sqlWhereThe string can affect the filter and queryVector arguments.

Here is a simple example of combining the SQL and query specification approaches. The effective result is the same as the example query specification above.

{
"sql": "select * from movies limit 3",
"nearText": "dracula"
}

Note that a SQL numeric list is enclosed in parentheses while one in the specification is enclosed in brackets. That is a subtle nuance across the SQL and JSON syntax standards.

SQL Clauses

The mapping of the standard SQL clauses to their S3 Vectors equivalents is described below.

Clause Description
SELECTThe names of the fields to return. Specifying “*” indicates all object fields . This does not set the returnProperties field in which case S3 Vectors returns all of the properties. You can also reference metadata properties. This list of strings is passed as the returnProperties value. Here are some examples.
SELECT * …
SELECT title, rated …
SELECT title, rated, plot, distance …
SELECT *,data …

Including “data” in the SELECT list sets the returnData field of the query specification to true. Including “distance” sets the returnDistance field to true.
Note to reference the key column enclose it in single quotes.
FROMThe name of the index. This value sets the “indexName” field in the query specification. If the Qarbine data service does not have a specific bucket set, then use the “bucket.index” syntax for the SQL table reference. This sets both the vectorBucketName and indexName in the query specification.
WHERESee the discussion below. The effect is to mainly set the “filter” field of the query specification.
ORDER BYThis clause is not consumable by S3 Vectors. S3 Vectors always returns results in closes distance order. To apply any other sort, use Qarbine pragmas as discussed in the main Data Source Designer guide.
LIMITIndicates at most how many elements to return. This sets the “topK” field of the query specification.

Bear in mind that some combinations of query fields may not make sense in the S3 Vectors world.

SQL Functions

There is no filter value in the query specification. Some additional Qarbine defined SQL functions are listed below.

nearText(aPhrase [,anAlias] )
nearVector(number1, number n …)
vector = (number 1, number n ...)A different way of expressing nearVector()
data = (number 1, number n ...)The other column alias

Qarbine automatically converts any queryVector value which is an array of numbers into “{float32: [...] }” format.

The WHERE clause criteria can be in a variety of traditional SQL forms and may include Qarbine specific functions described below.

Function Description
nearVectorThis clause is removed from the WHERE criteria and its list of numbers are set into the “queryVector” field of the query specification.
keyInThis clause is removed from the WHERE criteria and its list of keys are set into the “keys” field of the query specification.
nearTextThis clause is removed from the WHERE criteria and its argument set into the “nearText” field of the query specification.
withOptionPass in the specification field name and the value to set. This clause is removed from the WHERE clause. See the section below for more details.
withOptionsSet several specification fields at once. The format is
withOptions(key1, value1, keyN, valueN)
The key argument may use dot notation when setting the inner value of a component object. See the section below for more details.

An SQL example is

SELECT * 
FROM movies
WHERE nearText("dracula")

This results in a query specification with these fields

indexName: "movies",
nearText: "dracula"

The nearText value will be used with a Qarbine AI Assistant to obtain an embedding which is then used as the queryVector value. The default AI Assistant will be used in this case. To use a specific one thenadd a 2nd argument to nearText as shown below.

SELECT * 
FROM movies
WHERE nearText("dracula", “myAiAssistantAlias”)

A more constraining query from our initial one is

SELECT * 
FROM movies
WHERE nearText("dracula") and rated = "PG"

The above references the rated property as a SQL column. The equivalent specification is

 {
"vectorBucketName": "media-entertainment",
"indexName": "movies",
"filter": { "rated": "PG" },
"returnMetadata": true,
"returnDistance": true,
"nearText": "dracula"
}

Here is a query to return all the properties and the embedding data as well.

SELECT *, data 
FROM movies
LIMIT 2

Compound WHERE Clause Criteria

Compound criteria is permitted. For example, this SQL

SELECT *, data
FROM movies
WHERE
withOption('returnDistance', false)
and runtime < 120
and rated = 'R'
and nearVector (...)

generates this underlying query specification

{
vectorBucketName: "media-entertainment",
indexName: "movies",
filter: {
$and: [
{ runtime: { $lt: 120 } },
{rated: "R"}
]
- },
returnMetadata: true,
returnDistance: false,
returnData: true,
queryVector: [...],
topK: 30
}

Retrieving a Specific Vector

With AWS S3 Vectors, you cannot directly retrieve a vector's full data by searching for an exact match of its vector values. Instead, you retrieve vectors by running a similarity search: you submit a query vector, and the system returns the top K most similar vectors (based on cosine or Euclidean distance), possibly with their data, distance scores, and metadata if requested.

The query specification would be

{
vectorBucketName: 'media-entertainment',
indexName: "movies",
queryVector: [...],
topK: 1,
returnDistance: true,
returnMetadata: true,
}

The SQL equivalent would be

SELECT * FROM movies WHERE vector = (...)

or using the other column alias

SELECT * FROM movies WHERE data = (...)

Specify a LIMIT clause to retrieve objects near the given vector. Otherwise an implied “LIMIT 1” will be set into the query specification. For the former, the nearVector() function is semantically better.

Retrieving Specific Keys

The SQL statement’s WHERE clause may refer to one or more specific keys. Only equality checking is allowed. The snippet

"key" = 'movie_11'

is effectively

keyIn('movie_11')

The snippet

"key" = 'movie_11' OR "key" = 'movie_70'

is effectively

keyIn('movie_11',  'movie_70')

All ‘"key" = XX’ snippets are processed as an additional keyIn value no matter the conjunction (OR vs. AND). Notice the double quotes around the "key” to avoid a keyword conflict.

Setting Other Query Property Fields

These WHERE clause functions provide a way to set query specification properties which do not map well to the nature of SQL but satisfy the syntax of SQL. The function are:

withOption(name, value);
withOptions(name1, value1, nameN, valueN);

Here is an example

withOption('returnDistance', false)

If the value is a string which looks like a JSON object (starts with “[“ or “{“) then a conversion will be made to determine the value prior to setting the property in the query specification.

Reviewing the Generated Specification

You can enter criteria of the form “EXPLAIN SELECT ….” to have the SQL statement processed and have the returned answer set be the underlying query specification. For example enter and run

explain
SELECT * FROM movies
WHERE year >= 1995
LIMIT 10

Select the single result element and its details are shown to the right. Click the “+” to expand all of the JSON object fields.

  

A convenient way of specifying this is to have “explain” on the first line and the rest of your SQL on the next lines.

explain
SELECT * from movies
WHERE year >= 1995
LIMIT 10

Then simply “comment out” the first line when not in use

// explain
SELECT * from movies
WHERE year >= 1995
LIMIT 10

You can also use “explain: true” in the JSON query specification for similar information.
Another way to get the specification is to press ALT and click   . For the query

SELECT * from movies 
WHERE year >= 1995
LIMIT 10

the result is shown below.

  

Any “explain SELECT” or “explain: true” takes precedence over the ALT-click interaction.

Qarbine Virtual Queries

There are a few convenience queries which are mainly DBA oriented. These queries are recognized by the Qarbine driver and provide common database information. Any bucket set in the data service definition constrains what is returned. For example, if a bucket (generically a “database”) is given in the data service, then only objects in that one bucket are returned.

These virtual query defaults are independent of whatever drop down option is chosen in the Data Source Designer tool. If a specific schema’s information is wanted for example, it must be explicitly given.

Query Description
list bucketsReturn a list of visible vector buckets.
describe bucketsReturn the details of all the visible buckets.
describe bucket BUCKETReturn the details of the given bucket.
list indexes [BUCKET]Return a list of indexes. The optional argument may be a bucket name.
describe indexes [BUCKET]Provide details on all of the indexes. This may take a while depending on your bucket structure.
describe index INDEXProvide details on the given index.

References

More information on S3 Vectors can be found at https://aws.amazon.com/s3/features/vectors/ and https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors.html.